Pesquisa | Portal Regional da BVS

1.

Self-Supervised Contrastive Molecular Representation Learning with a Chemical Synthesis Knowledge Graph.

Xie, Jiancong; Wang, Yi; Rao, Jiahua; Zheng, Shuangjia; Yang, Yuedong.

J Chem Inf Model ; 64(6): 1945-1954, 2024 Mar 25.

Artigo em Inglês | MEDLINE | ID: mdl-38484468

RESUMO

Self-supervised molecular representation learning has demonstrated great promise in bridging machine learning and chemical science to accelerate the development of new drugs. Due to the limited reaction data, existing methods are mostly pretrained by augmenting the intrinsic topology of molecules without effectively incorporating chemical reaction prior information, which makes them difficult to generalize to chemical reaction-related tasks. To address this issue, we propose ReaKE, a reaction knowledge embedding framework, which formulates chemical reactions as a knowledge graph. Specifically, we constructed a chemical synthesis knowledge graph with reactants and products as nodes and reaction rules as the edges. Based on the knowledge graph, we further proposed novel contrastive learning at both molecule and reaction levels to capture the reaction-related functional group information within and between molecules. Extensive experiments demonstrate the effectiveness of ReaKE compared with state-of-the-art methods on several downstream tasks, including reaction classification, product prediction, and yield prediction.

Assuntos

Aprendizado de Máquina , Reconhecimento Automatizado de Padrão

2.

Predicting disease-gene associations through self-supervised mutual infomax graph convolution network.

Xie, Jiancong; Rao, Jiahua; Xie, Junjie; Zhao, Huiying; Yang, Yuedong.

Comput Biol Med ; 170: 108048, 2024 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-38310804

RESUMO

Illuminating associations between diseases and genes can help reveal the pathogenesis of syndromes and contribute to treatments, but a large number of associations remained unexplored. To identify novel disease-gene associations, many computational methods have been developed using disease and gene-related prior knowledge. However, these methods remain of relatively inferior performance due to the limited external data sources and the inevitable noise among the prior knowledge. In this study, we have developed a new method, Self-Supervised Mutual Infomax Graph Convolution Network (MiGCN), to predict disease-gene associations under the guidance of external disease-disease and gene-gene collaborative graphs. The noises within the collaborative graphs were eliminated by maximizing the mutual information between nodes and neighbors through a graphical mutual infomax layer. In parallel, the node interactions were strengthened by a novel informative message passing layer to improve the learning ability of graph neural network. The extensive experiments showed that our model achieved performance improvement over the state-of-art method by more than 8 % on AUC. The datasets, source codes and trained models of MiGCN are available at https://github.com/biomed-AI/MiGCN.

Assuntos

Aprendizagem , Redes Neurais de Computação , Humanos , Software , Síndrome

3.

Gene based message passing for drug repurposing.

Wang, Yuxing; Li, Zhiyang; Rao, Jiahua; Yang, Yuedong; Dai, Zhiming.

iScience ; 26(9): 107663, 2023 Sep 15.

Artigo em Inglês | MEDLINE | ID: mdl-37670781

RESUMO

The medicinal effect of a drug acts through a series of genes, and the pathological mechanism of a disease is also related to genes with certain biological functions. However, the complex information between drug or disease and a series of genes is neglected by traditional message passing methods. In this study, we proposed a new framework using two different strategies for gene-drug/disease and drug-disease networks, respectively. We employ long short-term memory (LSTM) network to extract the flow of message from series of genes (gene path) to drug/disease. Incorporating the resulting information of gene paths into drug-disease network, we utilize graph convolutional network (GCN) to predict drug-disease associations. Experimental results showed that our method GeneDR (gene-based drug repurposing) makes better use of the information in gene paths, and performs better in predicting drug-disease associations.

4.

Integrating supercomputing and artificial intelligence for life science.

Rao, Jiahua; Zheng, Shuangjia; Yang, Yuedong.

Patterns (N Y) ; 3(12): 100653, 2022 Dec 09.

Artigo em Inglês | MEDLINE | ID: mdl-36569549

RESUMO

Jiahua Rao and Shuangjia Zheng are Ph.D. students in Prof. Yang's lab (Supercomputing And AI for Life science, SAIL Lab) at Sun Yat-sen University. They recently developed an interpretable framework to quantitatively assess the interpretability of Graph Neural Network (GNN) and made comparison with medicinal chemists. Their meaningful benchmarking and rigorous framework would greatly benefit development of new interpretable methods in GNNs.

5.

Quantitative evaluation of explainable graph neural networks for molecular property prediction.

Rao, Jiahua; Zheng, Shuangjia; Lu, Yutong; Yang, Yuedong.

Patterns (N Y) ; 3(12): 100628, 2022 Dec 09.

Artigo em Inglês | MEDLINE | ID: mdl-36569553

RESUMO

Graph neural networks (GNNs) have received increasing attention because of their expressive power on topological data, but they are still criticized for their lack of interpretability. To interpret GNN models, explainable artificial intelligence (XAI) methods have been developed. However, these methods are limited to qualitative analyses without quantitative assessments from the real-world datasets due to a lack of ground truths. In this study, we have established five XAI-specific molecular property benchmarks, including two synthetic and three experimental datasets. Through the datasets, we quantitatively assessed six XAI methods on four GNN models and made comparisons with seven medicinal chemists of different experience levels. The results demonstrated that XAI methods could deliver reliable and informative answers for medicinal chemists in identifying the key substructures. Moreover, the identified substructures were shown to complement existing classical fingerprints to improve molecular property predictions, and the improvements increased with the growth of training data.

6.

AlphaFold2-aware protein-DNA binding site prediction using graph transformer.

Yuan, Qianmu; Chen, Sheng; Rao, Jiahua; Zheng, Shuangjia; Zhao, Huiying; Yang, Yuedong.

Brief Bioinform ; 23(2)2022 03 10.

Artigo em Inglês | MEDLINE | ID: mdl-35039821

RESUMO

Protein-DNA interactions play crucial roles in the biological systems, and identifying protein-DNA binding sites is the first step for mechanistic understanding of various biological activities (such as transcription and repair) and designing novel drugs. How to accurately identify DNA-binding residues from only protein sequence remains a challenging task. Currently, most existing sequence-based methods only consider contextual features of the sequential neighbors, which are limited to capture spatial information. Based on the recent breakthrough in protein structure prediction by AlphaFold2, we propose an accurate predictor, GraphSite, for identifying DNA-binding residues based on the structural models predicted by AlphaFold2. Here, we convert the binding site prediction problem into a graph node classification task and employ a transformer-based variant model to take the protein structural information into account. By leveraging predicted protein structures and graph transformer, GraphSite substantially improves over the latest sequence-based and structure-based methods. The algorithm is further confirmed on the independent test set of 181 proteins, where GraphSite surpasses the state-of-the-art structure-based method by 16.4% in area under the precision-recall curve and 11.2% in Matthews correlation coefficient, respectively. We provide the datasets, the predicted structures and the source codes along with the pre-trained models of GraphSite at https://github.com/biomed-AI/GraphSite. The GraphSite web server is freely available at https://biomed.nscc-gz.cn/apps/GraphSite.

Assuntos

Algoritmos , Proteínas , Sítios de Ligação , DNA/metabolismo , Ligação Proteica , Domínios Proteicos , Proteínas/química

7.

Integrating multi-omics data through deep learning for accurate cancer prognosis prediction.

Chai, Hua; Zhou, Xiang; Zhang, Zhongyue; Rao, Jiahua; Zhao, Huiying; Yang, Yuedong.

Comput Biol Med ; 134: 104481, 2021 07.

Artigo em Inglês | MEDLINE | ID: mdl-33989895

RESUMO

BACKGROUND: Genomic information is nowadays widely used for precise cancer treatments. Since the individual type of omics data only represents a single view that suffers from data noise and bias, multiple types of omics data are required for accurate cancer prognosis prediction. However, it is challenging to effectively integrate multi-omics data due to the large number of redundant variables but relatively small sample size. With the recent progress in deep learning techniques, Autoencoder was used to integrate multi-omics data for extracting representative features. Nevertheless, the generated model is fragile from data noises. Additionally, previous studies usually focused on individual cancer types without making comprehensive tests on pan-cancer. Here, we employed the denoising Autoencoder to get a robust representation of the multi-omics data, and then used the learned representative features to estimate patients' risks. RESULTS: By applying to 15 cancers from The Cancer Genome Atlas (TCGA), our method was shown to improve the C-index values over previous methods by 6.5% on average. Considering the difficulty to obtain multi-omics data in practice, we further used only mRNA data to fit the estimated risks by training XGboost models, and found the models could achieve an average C-index value of 0.627. As a case study, the breast cancer prognosis prediction model was independently tested on three datasets from the Gene Expression Omnibus (GEO), and shown able to significantly separate high-risk patients from low-risk ones (C-index>0.6, p-values<0.05). Based on the risk subgroups divided by our method, we identified nine prognostic markers highly associated with breast cancer, among which seven genes have been proved by literature review. CONCLUSION: Our comprehensive tests indicated that we have constructed an accurate and robust framework to integrate multi-omics data for cancer prognosis prediction. Moreover, it is an effective way to discover cancer prognosis-related genes.

Assuntos

Neoplasias da Mama , Aprendizado Profundo , Neoplasias da Mama/genética , Feminino , Perfilação da Expressão Gênica , Genômica , Humanos , Oncogenes

8.

Imputing single-cell RNA-seq data by combining graph convolution and autoencoder neural networks.

Rao, Jiahua; Zhou, Xiang; Lu, Yutong; Zhao, Huiying; Yang, Yuedong.

iScience ; 24(5): 102393, 2021 May 21.

Artigo em Inglês | MEDLINE | ID: mdl-33997678

RESUMO

Single-cell RNA sequencing technology promotes the profiling of single-cell transcriptomes at an unprecedented throughput and resolution. However, in scRNA-seq studies, only a low amount of sequenced mRNA in each cell leads to missing detection for a portion of mRNA molecules, i.e. the dropout problem which hinders various downstream analyses. Therefore, it is necessary to develop robust and effective imputation methods for the increasing scRNA-seq data. In this study, we have developed an imputation method (GraphSCI) to impute the dropout events in scRNA-seq data based on the graph convolution networks. Extensive experiments demonstrated that GraphSCI outperforms other state-of-the-art methods for imputation on both simulated and real scRNA-seq data. Meanwhile, GraphSCI is able to accurately infer gene-to-gene relationships and the inferred gene-to-gene relationships could also provide powerful assistance for imputation dynamically during the training process, which is a key promotion of GraphSCI compared with other imputation algorithms.

9.

PharmKG: a dedicated knowledge graph benchmark for bomedical data mining.

Zheng, Shuangjia; Rao, Jiahua; Song, Ying; Zhang, Jixian; Xiao, Xianglu; Fang, Evandro Fei; Yang, Yuedong; Niu, Zhangming.

Brief Bioinform ; 22(4)2021 07 20.

Artigo em Inglês | MEDLINE | ID: mdl-33341877

RESUMO

Biomedical knowledge graphs (KGs), which can help with the understanding of complex biological systems and pathologies, have begun to play a critical role in medical practice and research. However, challenges remain in their embedding and use due to their complex nature and the specific demands of their construction. Existing studies often suffer from problems such as sparse and noisy datasets, insufficient modeling methods and non-uniform evaluation metrics. In this work, we established a comprehensive KG system for the biomedical field in an attempt to bridge the gap. Here, we introduced PharmKG, a multi-relational, attributed biomedical KG, composed of more than 500 000 individual interconnections between genes, drugs and diseases, with 29 relation types over a vocabulary of ~8000 disambiguated entities. Each entity in PharmKG is attached with heterogeneous, domain-specific information obtained from multi-omics data, i.e. gene expression, chemical structure and disease word embedding, while preserving the semantic and biomedical features. For baselines, we offered nine state-of-the-art KG embedding (KGE) approaches and a new biological, intuitive, graph neural network-based KGE method that uses a combination of both global network structure and heterogeneous domain features. Based on the proposed benchmark, we conducted extensive experiments to assess these KGE models using multiple evaluation metrics. Finally, we discussed our observations across various downstream biological tasks and provide insights and guidelines for how to use a KG in biomedicine. We hope that the unprecedented quality and diversity of PharmKG will lead to advances in biomedical KG construction, embedding and application.

Assuntos

Pesquisa Biomédica , Mineração de Dados , Redes Neurais de Computação , Semântica , Software , Benchmarking , Humanos

10.

Accurate prediction of genome-wide RNA secondary structure profile based on extreme gradient boosting.

Ke, Yaobin; Rao, Jiahua; Zhao, Huiying; Lu, Yutong; Xiao, Nong; Yang, Yuedong.

Bioinformatics ; 36(17): 4576-4582, 2020 11 01.

Artigo em Inglês | MEDLINE | ID: mdl-32467966

RESUMO

MOTIVATION: RNA secondary structure plays a vital role in fundamental cellular processes, and identification of RNA secondary structure is a key step to understand RNA functions. Recently, a few experimental methods were developed to profile genome-wide RNA secondary structure, i.e. the pairing probability of each nucleotide, through high-throughput sequencing techniques. However, these high-throughput methods have low precision and cannot cover all nucleotides due to limited sequencing coverage. RESULTS: Here, we have developed a new method for the prediction of genome-wide RNA secondary structure profile from RNA sequence based on the extreme gradient boosting technique. The method achieves predictions with areas under the receiver operating characteristic curve (AUC) >0.9 on three different datasets, and AUC of 0.888 by another independent test on the recently released Zika virus data. These AUCs are consistently >5% greater than those by the CROSS method recently developed based on a shallow neural network. Further analysis on the 1000 Genome Project data showed that our predicted unpaired probabilities are highly correlated (>0.8) with the minor allele frequencies at synonymous, non-synonymous mutations, and mutations in untranslated regions, which were higher than those generated by RNAplfold. Moreover, the prediction over all human mRNA indicated a consistent result with previous observation that there is a periodic distribution of unpaired probability on codons. The accurate predictions by our method indicate that such model trained on genome-wide experimental data might be an alternative for analytical methods. AVAILABILITY AND IMPLEMENTATION: The GRASP is available for academic use at https://github.com/sysu-yanglab/GRASP. SUPPLEMENTARY INFORMATION: Supplementary data are available online.

Assuntos

Infecção por Zika virus , Zika virus , Sequência de Bases , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Redes Neurais de Computação , RNA/genética , Software

11.

Predicting Retrosynthetic Reactions Using Self-Corrected Transformer Neural Networks.

Zheng, Shuangjia; Rao, Jiahua; Zhang, Zhongyue; Xu, Jun; Yang, Yuedong.

J Chem Inf Model ; 60(1): 47-55, 2020 01 27.

Artigo em Inglês | MEDLINE | ID: mdl-31825611

RESUMO

Synthesis planning is the process of recursively decomposing target molecules into available precursors. Computer-aided retrosynthesis can potentially assist chemists in designing synthetic routes; however, at present, it is cumbersome and cannot provide satisfactory results. In this study, we have developed a template-free self-corrected retrosynthesis predictor (SCROP) to predict retrosynthesis using transformer neural networks. In the method, the retrosynthesis planning was converted to a machine translation problem from the products to molecular linear notations of the reactants. By coupling with a neural network-based syntax corrector, our method achieved an accuracy of 59.0% on a standard benchmark data set, which outperformed other deep learning methods by >21% and template-based methods by >6%. More importantly, our method was 1.7 times more accurate than other state-of-the-art methods for compounds not appearing in the training set.

Assuntos

Técnicas de Química Sintética/métodos , Redes Neurais de Computação , Conjuntos de Dados como Assunto

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA